-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi cluster support #44
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: YangKeao <[email protected]>
a1dc9e9
to
4b7b11c
Compare
Signed-off-by: YangKeao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we need more info about dashboard or how to display status of chaos-meshs in multi clusters & communicate with them.
Signed-off-by: YangKeao <[email protected]>
Signed-off-by: YangKeao <[email protected]>
Signed-off-by: YangKeao <[email protected]>
Signed-off-by: YangKeao <[email protected]>
…i-cluster-support Signed-off-by: YangKeao <[email protected]>
Signed-off-by: YangKeao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's awesome!
And maybe we should keep consistent with the name of "parent" in the future documentations, like "Management Cluster"
install or upgrade the chaos mesh to v2.2.0 in the target cluster. The | ||
`configOverride` allows to override the global configurations. The full |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
configOverride
allows to override the global configurations.
So the content of configOverride
would be similar to the helm chart values?
What happens if Spec.configOverride
changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the best way is to run helm upgrade. We could compare the config, and if config changed, we should upgrade them.
This feature is really needed, because we hope the user could upgrade the config on parent cluster, and these config should be dispatched to the remote cluster automatically.
The `Spec` defines the cluster which we want to install the chaos mesh, and the | ||
version of chaos mesh. It will install the same version of Chaos Mesh as the | ||
controller’s, which means if the version of controller is v2.2.0, it will | ||
install or upgrade the chaos mesh to v2.2.0 in the target cluster. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the version of controller is v2.2.0, it will install or upgrade the chaos mesh to v2.2.0 in the target cluster
IIUC, if the target RemoteCluster already contains a Chaos Mesh installation, we want to "align" the version to be the same as the "Parent Cluster".
What would happen if "RemoteCluster Version" > "ParentCluster Version"? Maybe we need some version skew policy like https://kubernetes.io/releases/version-skew-policy/?
Is it possible that I want to use a patched version on RemoteCluster? I think that would be important when we developing and debugging/profiling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if "RemoteCluster Version" > "ParentCluster Version"? Maybe we need some version skew policy like https://kubernetes.io/releases/version-skew-policy/?
I don't think we will have enough energy to manage a long skew, so only the same (minor) version could be supported. I'll add this to the RFC.
Is it possible that I want to use a patched version on RemoteCluster? I think that would be important when we developing and debugging/profiling
You could use configOverride
to override the image.
2. Why is the `remoteCluster` an extra field? Would a standalone resource / | ||
annotation / … be better? | ||
|
||
I can’t say which one is better. It’s a blind choice. I don’t want to create | ||
a new embedded standalone resource because the existing two embedded | ||
resources: Schedule and Workflow are too huge (but it’s actually not a strong | ||
evidence). I prefer fields over annotations because fields sound more like an | ||
official API, but annotations do not (or maybe we could start from | ||
annotations? I’m not sure). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to use annotations with v1alpha1
. ❤️ No more changes on api/v1alpha1
. 🤣🤣
Using annotations would bring cost more effort to implement the reconciler, but I think that would be OK/acceptable.
For implementation, the chaos mesh will set up multiple “Managers”, one for each | ||
cluster. These managers will set up controllers to synchronize the things. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, about the implementation, are these "Managers" actually controller-runtime Reoncilers?
4. Where will the webhook run? | ||
|
||
In the parent chaos mesh. Because we will not create chaos in the webhook | ||
(but in the reconcile), the validation should be done in the parent’s | ||
webhook. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not actually sure about that. I prefer to keep all the webhooks for both "ParentCluster" and "RemoteCluster" because of the possible version skew.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it seems we need to prevent users from creating *Chaos in RemoteCluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not actually sure about that. I prefer to keep all the webhooks for both "ParentCluster" and "RemoteCluster" because of the possible version skew.
It's fine to run duplicated webhook multiple times (as our webhook doesn't have side-effect now 😃 ), so it doesn't mean we couldn't run webhook in the remote cluster.
And it seems we need to prevent users from creating *Chaos in RemoteCluster?
I don't think so. Creating *Chaos in RemoteCluster doesn't break the whole design. We have a label/annotation in the remote cluster to mark it as managed by the one in parent cluster (and it's already needed, for GC).
will create a SubjectAccessReview to check whether the user is able to create | ||
this object in the target cluster/namespace. | ||
|
||
The only problem is that the username could be different among the clusters. To |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only problem is that the username could be different among the clusters.
Not only the username but also the user credential.
It requires users to provide different credentials for different clusters. It seems we would save the credentials into userMap
.
When the "Manager" sync the *Chaos from "ParentCluster" to "RemoteCluster", it requires a certain kubeconfig with users' credential. (Or maybe we just use the given kubeconfig 🤷♂️)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not only the username but also the user credential.
I think chaos mesh itself could use a super powerful credential, and impersonate any one (with only username
) to create things.
Or is it possible to only validate the authorization only in the parent cluster? As the chaos controller in the parent cluster could also create SubjectAccessReview
in the remote cluster. I think it's better as the authorization error will block the user creating the resource (which he don't actually have privilege to create).
Signed-off-by: YangKeao <[email protected]>
Signed-off-by: YangKeao <[email protected]>
Signed-off-by: YangKeao <[email protected]>
In the remote cluster, the chaos-controller-manager and chaos-daemon will be | ||
responsible for injecting / recovering chaos | ||
|
||
These two chaos-controller-manager could share a single binary, so the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think sharing a single binary is a good idea. The logic of parent modeis different from
child mode`, if we share in a single binary, which will make chaos-controller-manager more complicated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well. It is not that different actually. Just like we can run Schedule
and Workflow
controller in a standalone binary, but running them in the same binary doesn't bring too much trouble.
At least, I found I'll need to clarify the statement. It seems that three modes are somehow duplicated, the remote cluster chaos mesh doesn't need to know itself is a child, and the parent chaos mesh should also be able to inject chaos into current (parent) cluster.
metadata: | ||
name: burn-cpu | ||
spec: | ||
remoteCluster: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can remoteCluster
be used as a selector instead of a single field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, we cannot assume the different clusters are isomorphic, so if you are trying to run the same selector on different remote cluster, I doubt whether it's meaningful. Just like the containerNames
, which is also embarrassing, as its existence shows our assumption of the isomorphic between pods.
And it's also hard to manage the status for chaos in multiple different clusters.
- type: Installed | ||
status: True | ||
- type: Ready | ||
status: True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems that the Ready
include the Installed
The total architecture is really simple. | ||
|
||
In the management (or so-called parent) cluster, the chaos-controller-manager | ||
will be responsible for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the parent cluster also manages the PhysicalMachine
|
||
As the plan shown above, we don't need too much modification on dashboard. The | ||
cluster installation status should be represented in `RemoteCluster`, and the | ||
execution status of a standalone chaos should be recorded in the `.Status` of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
execution status of a standalone chaos should be recorded in the `.Status` of | |
execution status of a standalone chaos should be recorded in the `Status` of |
chaos-controller-manager could work in three different modes: | ||
|
||
1. parent mode, like discussed in former section | ||
2. child mode, like discussed in former section | ||
3. single mode, just the normal one we are using now | ||
|
||
(maybe these modes can be described through turning on/off some controllers / | ||
webhooks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems there is no clear fields to describe the mode? And will we control the relationship between clusters, for example:
- can a cluster be managed by different clusters? if not, how do we prevent this operation?
- can a cluster be a parent and a child at the same time? if not, how do we prevent this operation?
Signed-off-by: YangKeao [email protected]
rendered version